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ABSTRACT 

While traditional data-management systems focus on evaluating single, ad- 
hoc queries over static data sets in a centralized setting, several emerg- 
ing applications require (possibly, continuous) answers to queries on dy- 
namic data that is widely distributed and constantly updated. Furthermore, 
such query answers often need to discount data that is "stale", and operate 
solely on a sliding window of recent data arrivals (e.g., data updates occur- 
ring over the last 24 hours). Such distributed data streaming applications 
mandate novel algorithmic solutions that are both time- and space-efficient 
(to manage high-speed data streams), and also communication-efficient (to 
deal with physical data distribution). In this paper, we consider the prob- 
lem of complex query answering over distributed, high-dimensional data 
streams in the sliding-window model. We introduce a novel sketching tech- 
nique (termed ECM-sketch) that allows effective summarization of stream- 
ing data over both time-based and count-based sliding windows with prob- 
abilistic accuracy guarantees. Our sketch structure enables point as well 
as inner-product queries, and can be employed to address a broad range 
of problems, such as maintaining frequency statistics, finding heavy hit- 
ters, and computing quantiles in the sliding-window model. Focusing on 
distributed environments, we demonstrate how ECM-sketches of individ- 
ual, local streams can be composed to generate a (low-error) ECM-sketch 
summary of the order-preserving aggregation of all streams; furthermore, 
we show how ECM-sketches can be exploited for continuous monitoring 
of sliding-window queries over distributed streams. Our extensive experi- 
mental study with two real-life data sets validates our theoretical claims and 
verifies the effectiveness of our techniques. To the best of our knowledge, 
ours is the first work to address efficient, guaranteed-error complex query 
answering over distributed data streams in the sliding-window model. 

1. INTRODUCTION 

The ability to process, in real time, continuous high-volume stre- 
ams of data is a common requirement in many emerging applica- 
tion environments. Examples of such applications include, sensor 
networks, financial data trackers, and intrusion-detection systems. 
As a result, in recent years, we have seen a flurry of activity in 
the area of data-stream processing. Unlike conventional database 
query processing that requires several passes over a static, archived 
data image, data-stream processing algorithms often rely on build- 
ing concise, approximate (yet, accurate) sketch synopses of the in- 
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put streams in real time (i.e., in one pass over the streaming data). 
Such sketch structures typically require small space and update 
time (both significantly sublinear in the size of the data), and can be 
used to provide approximate query answers with guarantees on the 
quality of the approximation. These answers can be more than suf- 
ficient for typical exploratory analysis of massive data, where the 
goal is to detect interesting statistical behavior and patterns rather 
than obtain answers that are precise to the last decimal. Large-scale 
stream processing applications are also inherently distributed, with 
several remote sites observing their local stream(s) and exchanging 
information through a communication network. This distribution 
of the data naturally imposes critical communication- efficiency re- 
quirements that prohibit naive solutions that centralize all the data, 
due to its massive volume and/or the high cost of communication 
(e.g., in sensornets). Communication efficiency is particularly im- 
portant for distributed event-monitoring scenarios (e.g., monitoring 
sensor or IP networks), where the goal is real-time tracking of dis- 
tributed measurements and events, rather than one-shot answers to 
sporadic queries [25]. 

Several query models for streaming data have been explored over 
the past decade. Streaming data items naturally carry a notion 
of "time", and, in many applications, it is important to be able 
to downgrade the importance (or, weight) of older items; for in- 
stance, in the statistical analysis of trends or patterns in financial 
data streams, data that is more than a few months old might be 
considered "stale" and irrelevant. Various time-decay models for 
querying streaming data have been proposed in the literature, mostly 
differentiating on the relation of an item's weight to its age (e.g., ex- 
ponential or polynomial decay [6]). The sliding-window model [12] 
is one of the most prominent and intuitive time-decay models that 
considers only a window of the most recent items seen in the stream 
thus far (i.e., items outside the window are "aged out" or given a 
weight of zero). The window itself can be either time-based (i.e., 
items seen in the last N time units) or count-based (i.e., the last 
N items). Several algorithms have been proposed for maintaining 
different types of statistics over sliding-window data streams while 
requiring time and space that is significantly sublinear (typically, 
poly-logarithmic) in the window size TV [12, 15, 24, 26]. Still, the 
bulk of existing work on the sliding-window model has focused on 
tracking basic counts and other simple aggregates (e.g., sums) over 
one-dimensional streams in a centralized setting. Some recent work 
has also considered the case of distributed data, however, no exist- 
ing techniques can handle flexible, complex aggregate queries over 
rapid, high-dimensional distributed data streams, e.g., with each 
dimension corresponding to the frequency of a distinct key in the 
stream. 

Example: Recent work on effective network-monitoring systems 
(e.g., for detecting DDoS attacks or network-wide anomalies in 
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large-scale IP networks) has stressed the importance of an efficient 
distributed-triggering functionality [20, 22, 18, 17]. In their early 
work, Jain et al. [20] discuss a generic distributed attack-detection 
scheme relying on the ability to maintain frequency statistics for 
high-dimensional data over sliding windows. In particular, each 
node (e.g., a network router implementing Cisco's Netflow proto- 
col, a wireless access point, or a peer in a P2P network) maintains 
a sliding-window count of all observed messages for each target IP 
address. If this count exceeds a pre-determined threshold, which is 
determined based on the capacity of the target machine (possibly 
expressing the fair share of each client to the target machine), an 
event is triggered to a central coordinator as a warning of possible 
overloading. The coordinator then collects network-wide statistics 
to monitor overloaded nodes or abnormal behavior. More recent 
efforts have focused on different variants and extensions of this ba- 
sic scheme, often requiring more extensive data/statistics collection 
and more sophisticated analyses [18, 17]. (Note that such data col- 
lection mechanisms are supported by commercial products, such as 
the Cisco Netflow Collection Engine solution.) 

The ability to efficiently summarize high-dimensional data over 
sliding windows is obviously crucial to such network-monitoring 
schemes, given the tremendous volume of network-data streams 
and their massive domain sizes (e.g., 2 48 for IPv6 addresses). This 
raises a critical need for synopsis data structures that can compactly 
capture accurate frequency statistics for a vast domain space over 
sliding windows. Furthermore, to enable the coordinator to aggre- 
gate data coming from different nodes (a requirement for detecting 
DDoS attacks), we need to be able to compose individually con- 
structed synopses to a single synopsis which can capture the global 
state of the network and help isolate network-wide abnormalities. 
Thus, we are faced with the difficult challenge of designing ef- 
fective, composable synopses that can support potentially complex 
sliding-window analysis queries over massive, distributed network- 
data streams. □ 

Note that similar requirements are frequently observed in other 
domains, e.g., for identifying misbehaving nodes in large wireless 
networks, for training of classifiers with distributed training data 
that expires over time, and for ranking products in a cloud-based 
e-shop, based on the number of recent visits of each product. 

Our Contributions. In this paper, we consider the problem of an- 
swering potentially complex queries over distributed, high-dimen- 
sional data streams in the sliding-window model. Our contribu- 
tions can be summarized as follows. 

• ECM-Sketches for Sliding- Window Streams. We introduce a 
novel sketch synopsis (termed ECM-sketch) that allows effective 
summarization of streaming data over both time-based and count- 
based sliding windows with probabilistic accuracy guarantees. In 
a nutshell, our ECM-sketch combines the well-known Count-Min 
sketch structure [10] for conventional streams with state-of-the-art 
tools for sliding-window statistics. The end result is a sliding- 
window sketch synopsis that can provide provable, guaranteed- 
error performance for point, as well as inner-product, queries, and 
can be employed to address a broad range of problems, such as 
maintaining frequency statistics, finding heavy hitters, and com- 
puting quantiles in the sliding-window model. 

• Time-based Sliding Windows over Distributed Streams. Fo- 
cusing on distributed environments, we demonstrate how ECM- 
sketches summarizing time-based sliding windows of individual, 
local streams can be composed to generate a guaranteed-error ECM- 
sketch synopsis of the order-preserving aggregation of all streams. 
While conventional Count-Min sketches are trivially composable, 
composing ECM-sketches is more challenging, since it requires 



the composition of the sliding-window statistics maintained in the 
sketch. Compared to earlier work on composable, randomized sli- 
ding-window statistics [27, 15], our sliding window approximation 
technique is completely deterministic and is much more space ef- 
ficient (with a linear rather than a quadratic dependence on the ap- 
proximation error). This increased efficiency comes at the cost of 
a slight inflation of the worst-case error guarantee due to composi- 
tion. Furthermore, we demonstrate how our ECM-sketches can be 
exploited in the context of the geometric framework of Sharfman et 
al. [25] for continuous monitoring of sliding-window queries over 
distributed streams. 

• Experimental Study and Validation. We perform a thorough 
experimental evaluation of our techniques using two real-life data 
sets, in both centralized and distributed settings. The results of our 
study verify the efficiency and effectiveness of our ECM-sketch 
synopses in a variety of applications, and expose interesting func- 
tional trade-offs. When compared to algorithms based on random- 
ized sliding window synopses - which are the only ones that were 
considered for composition up to now - ECM-sketches reduce the 
memory and computational requirements by at least one order of 
magnitude with a very small loss in accuracy. Similar savings ap- 
ply to the network requirements. 

2. RELATED WORK 

Centralized and Distributed Data Streams. Most prior work on 
data-stream processing has focused on developing space-efficient, 
one-pass algorithms for performing a wide range of centralized, 
one-shot computations on massive data streams; examples include 
computing quantiles [16], estimating distinct values [14], count- 
ing frequent elements (i.e., "heavy hitters") [5, 9], and estimating 
join sizes and stream norms [1, 10]. Out of these efforts, flexi- 
ble, general-purpose sketch summaries, such as the AMS [1] and 
the Count-Min [10] sketch have found wide applicability in a broad 
range of stream-processing scenarios. More recent efforts have also 
concentrated on distributed-stream processing, proposing commu- 
nication-efficient streaming tools for handling a number of query 
tasks, including distributed tracking of simple aggregates [23], quan- 
tiles [8], and join aggregates [7], as well as monitoring distributed 
threshold conditions [25]. All the above-referenced works assume 
a traditional, "full-history" data stream and do not address the is- 
sues specific to the sliding-window model. 

Sliding- Window Stream Queries. As mentioned earlier, the bulk 
of existing work on the sliding-window model has focused on al- 
gorithms for maintaining simple statistics, such as basic counts and 
sums, in space and time that is significantly sub-linear (typically, 
poly-logarithmic) in the sliding-window size N. Exponential his- 
tograms [12] are a state-of-the-art deterministic technique for main- 
taining e-approximate counts and sums over sliding windows, using 
0{\ log 2 iV) space. Deterministic waves [15] solve the same ba- 
sic counting/summation problem with the same space complexity 
as exponential histograms, but improve the worst-case update time 
complexity to O(l); on the other hand, randomized waves [15] rely 
on randomization through hashing to track duplicate-insensitive 
counts (i.e., count-distinct aggregates) over sliding windows. 
While randomized waves can be easily composed (in distributed 
settings), they also come with an increased space requirement of 
^iogOA5) log2 N ^ wherg s is a smaU probability of failure Xu 

et al. [27] describe a randomized, sampling-based synopsis, very 
similar to randomized waves, for tracking sliding-window counts 
and sums with out-of-order arrivals (e.g., due to network delays) 
in a distributed setting. As with randomized waves, their space re- 
quirements are also quadratic in the inverse approximation error; 
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furthermore, their approach requires knowledge of the maximum 
number of elements in any sliding window (to set up the synopsis 
data structure), which could be problematic in dynamic, widely- 
distributed environments. Cormode et al. [11] also propose ran- 
domized techniques for handling out-of-order arrivals for tracking 
duplicate-insensitive sliding-window aggregates. To address the 
high cost associated with randomized data structures, Busch and 
Tirthapura propose a deterministic structure for handling out-of- 
order arrivals in sliding windows [3]. Similar to the other deter- 
ministic structures, this structure also does not allow composition 
and focuses only on basic counts and sums. Finally, Chan et al. [4] 
investigate continuous monitoring of exponential-histogram aggre- 
gates over distributed sliding windows. The main contribution of 
their work lies in the efficient scheduling of the propagation of the 
local exponential-histogram summaries to a coordinator, without 
violating prescribed accuracy guarantees. 

Going beyond counts, sums, and simple aggregates, there is sur- 
prisingly little work in the more general problem of maintaining 
general, frequency-distribution synopses over high-dimensional 
streaming data in the sliding-window model. Hung and Ting [19] 
and Dimitropoulos et al. [13] propose synopses based on Count- 
Min sketches for tracking heavy hitters and frequency counts over 
sliding windows; still, their techniques rely on keeping simple equi- 
width counters within the sketch, and, thus, cannot provide any 
meaningful error guarantees, especially for small query ranges. Sim- 
ilarly, the hybrid histograms of Qiao et al. [24] combine exponen- 
tial histograms with simplistic equi-width histograms for answer- 
ing sliding-window range queries; again, these structures cannot 
give meaningful bounds on the approximation error and cannot be 
composed in a distributed setting. 

3. PRELIMINARIES 

ECM-sketches combine the functionalities of Count-Min 
sketches [10] and exponential histograms [12]. We now describe 
the two structures, focusing on the aspect related to our work. 

Count-Min Sketches. Count-Min sketches are a widely applied 
sketching technique for data streams. A Count-Min sketch is com- 
posed of a set of d hash functions, hi(-), /i2(-)> • • •, hd(-), and a 2- 
dimensional array of counters of width w and depth d. Hash func- 
tion hj corresponds to row j of the array, mapping stream items to 
the range of [1 ... w]. Let CM[i,j] denote the counter at position 
(i, j) in the array. To add an item x of value v x in the Count-Min 
sketch, we increase the counters located at CM[hj(x),j] by v x , 
for j £ [1 . . . d]. A point query for an item q is answered by hash- 
ing the item in each of the d rows and getting the minimum value 
of the corresponding cells, i.e., min^ =1 CM[hj(q), j]. Note that 
hash collisions may cause estimation inaccuracies - only overesti- 
mations. By setting d — [ln(l/<5)] and w — |~e/e] , where e is the 
base of the natural logarithm, the structure enables point queries to 
be answered with an error of less than e \ \ a \ | i , with a probability of 
at least 1 — 8, where | \a\ | i denotes the number of items seen in the 
stream. Similar results hold for range and inner product queries. 

Exponential Histograms. Exponential histograms [12] are a de- 
terministic structure, proposed to address the basic counting prob- 
lem, i.e., for counting the number of true bits in the last N stream 
arrivals. They belong to the family of methods that break the slid- 
ing window range into smaller windows, called buckets or basic 
windows, to enable efficient maintenance of the statistics. Each 
bucket contains the aggregate statistics, i.e., number of arrivals and 
bucket bounds, for the corresponding sub-range. Buckets that no 
longer overlap with the sliding window are expired and discarded 
from the structure. To compute an aggregate over the whole (or 



Notation 


Description 


N 

MO 

a r , b r 
fa(x-, r) 
E a (i,j,r) 

u(N,S) 


Length of the sliding window, in time units or # arrivals 

Hash function i of the Count-Min sketch 

Substream of stream a, b, within the query range r 

Frequency of item x in stream a, within the query range r 

Estimated value of the ECM-sketch counter for stream a in 

position (i, j) for query range r 

Real and estimated inner product of a r and b r 

Upper bound of number of arrivals on stream S within the 

sliding window of length N 



Table 1: Frequently used notation. 

a part of) sliding window, the statistics from all buckets overlap- 
ping with the query range are aggregated. For example, for basic 
counting, aggregation is a summation of the number of true bits in 
the buckets. A possible estimation error can be introduced due to 
the oldest bucket inside the query range, which usually has only a 
partial overlap with the query. Therefore, the maximum possible 
estimation error is bounded by the size of the last bucket. 

To reduce the space requirements, exponential histograms main- 
tain buckets of exponentially increasing sizes. Bucket boundaries 
are chosen such that the ratio of the size of each bucket b with the 
sum of the sizes of all buckets more recent than b is upper bounded. 
In particular, the following invariant (invariant 1) is maintained for 
all buckets j: C , j/(2(1 + ^^~ 1 1 d)) < e where e denotes the max- 
imum acceptable relative error and Cj denotes the size of bucket j 
(number of true bits arrived in the bucket range), with bucket 1 
being the most recent bucket. Queries are answered by summing 
the sizes of all buckets that fully overlap the query range, and half 
of the size of the oldest bucket, if it partially overlaps the query. 
The estimation error is solely contained in the oldest bucket, and is 
therefore bounded by this invariant, resulting to a maximum rela- 
tive error of e. 

4. ECM-SKETCHES 

We now describe ECM-sketches (short for Exponential Count- 
Min sketches), a composable sketch for maintaining data stream 
statistics over sliding windows in distributed environments. ECM- 
sketches combine the functionality of Count-Min sketches and slid- 
ing windows, and support both time-based and count-based sliding 
windows under the cash register model. Therefore, they can be 
used for compactly summarizing high-dimensional streams over 
sliding windows, i.e., to maintain the observed frequencies of the 
stream items within the sliding window range. 

The core of the structure is a modified Count-Min sketch. Count- 
Min sketches alone cannot handle the sliding window requirement. 
To address this limitation, ECM-sketches replace the Count-Min 
counters with sliding window structures. Each counter is main- 
tained as a sliding window, covering the last N time units, or the 
last N arrivals, depending on whether we need time-based or count- 
based sliding windows. 

As discussed in Section 2, there have been several algorithms 
proposed for sliding window maintenance. Due to the large ex- 
pected number of sliding window counters in ECM-sketches, we 
require an algorithm with a small memory footprint. Randomized 
sliding window synopses are therefore not a good choice. Instead, 
we employ exponential histograms [12], a compact and efficient 
deterministic synopsis. Each of the Count-Min counters is imple- 
mented as an exponential histogram, configured to provide an e 
approximation for any query within a sliding window of length N, 
i.e., the estimation x of the counter for any query range within the 
sliding window length is in the range of (1 ± e)x of the true value 
x of the counter. We will be discussing our choice for exponential 
histograms again in more detail in the following section, where we 
will consider alternative deterministic and randomized algorithms. 
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Figure 1: Adding an element to the ECM-sketch. 

Adding an item x to the structure is similar to the case of the 
standard Count-Min sketches. The process for time-based sliding 
windows is depicted in Figure 1 . First, the counters CM [h j (x) , j ] , 
where j 6 {1 . . . d}, corresponding to the d hash functions are 
detected. For each of the counters, we register the arrival of the item 
at time t, and remove all expired information, i.e., the buckets of the 
exponential histogram that have no overlap with the sliding window 
range. The process for count-based sliding windows is similar, but 
instead of registering each arrival with system time t, we register it 
with the count of arrivals since the beginning of the stream. 

The challenges that need to be addressed for the integration of 
exponential histograms with Count-Min sketches are: (a) to take 
into account the additional error introduced by the sliding window 
counters for deriving the accuracy guarantees for ECM-sketches 
(presented in the remainder of this section), and, (b) to enable com- 
position of a set of ECM-sketches to a single ECM-sketch repre- 
senting the order-preserving aggregation of the corresponding indi- 
vidual streams (Section 5). 

4.1 Query Answering 

We now explain how ECM-sketches support point queries, inner 
product queries, and self-join queries, and we derive probabilistic 
guarantees for the accuracy of the estimation. Our analysis covers 
both sliding window models, i.e., time-based and count-based. 

Point Queries. A point query (x, r) is a combination of an item 
identifier x, and the query range r defined either as number of time 
units or number of arrivals. Point queries are executed as follows. 
The query item is hashed to the d counters CM[hj(x), j] where 
(j £ {1. . . d}), and the estimate of each counter E(hj(x), j, r) for 
the query range is computed. The estimate value for the frequency 
of x is f(x, r) = min :) =i... (J E(hj(x),j, r). 

Let 5 crn and e cm denote the configuration parameters of the Count- 
Min sketch, whereas e 3W denotes the configuration parameter of 
the exponential histogram. With ||a,-||i we denote the number of 
arrivals within the query range. The following theorem provides 
probabilistic guarantees for the approximation quality. 

THEOREM 1. \f(x,r)-f{x,r)\>(e sw +e cm + e sw e C m,)\\a T \\i 
with probability at most 5 = S cm . 

PROOF. Special case of Theorem 3, proved in the appendix. □ 
As is typical for small-space sketches, the error guarantees are rel- 
ative to the stream characteristics, i.e., the LI norm. For all pairs of 
e sw and e cm satisfying e sw + e cm + e sw e cm = e, the maximum es- 
timation error will be e| | a r \ | i . (Note that e £s e cm + t sw , since typi- 
cally e sw , e cm < 0.5, and thus the product E„e cm is much smaller 
than the two linear terms.) The optimal pair of e cm and e a -w is the 
one that minimizes memory utilization. The worst-case memory re- 
quirements of the structure are minimized as follows. The required 
memory per sliding window counter is 0(^— log 2 Z), where Z 
denotes the maximum possible count of each item in the sliding 
window. Therefore, the maximum required memory is mem = 
j 2 - log 2 Z x w x d, with c denoting a constant, w = [~e/e om ] , 
and d = [ln(l/<5 cm )] . By derivation we find that the memory 
bound is minimized for e sw — e cm — \/e + 1 — 1, and becomes 

Q( ln 2 Zln(l/S cm ) ^ _ Q, ln 2 Z ln(l/<5 em ) ^ 

Inner Product and Self- Join Queries. Another frequent query 
type is the cardinality of the inner product. Given two streams a 



and b, the inner product is defined as a b — X^gi? f°-( x ) x 
fb(x), where D denotes the input domain, i.e., the distinct input 
elements, and f a (x) (resp. (x) ) denotes the frequency of element 
x in stream a (resp. stream b). Self-join queries, also called the 
second frequency moment F2, are a special case of inner product 
queries defined over a single stream: 7<2(a) = X^ez> (f a ( x )) 2 - 
Both inner product queries and self-join queries are very important 
for databases, e.g., for building query execution plans, and they 
can be efficiently and accurately computed for streams with the 
cash register and turnstile model. However, similar to point queries, 
computing these queries over sliding windows is challenging. 

ECM-sketches can be used to address this type of queries as well. 
Let o r (resp. b r ) denote the substream of stream a (resp. 6) within 
the query range. With CM a we denote the corresponding ECM- 
sketch for stream a r , and with E a (i, j, r) we denote the estimated 
value of the counter of CM a in position (i, j), for query range r. 
Also, f a (x, r) and f a {x, r) denote the real and estimated frequency 
of x in stream a r . 

The inner product of two streams a and 6 in a range r is defined 
as a r b r = ~}2 xeT , fa(x,r)fb(x,r). Using the ECM-sketches 

of a and b, we estimate it as follows: a r b r = min^a,. b r )j, 
where (a r b r )j = Yh=i E <>(i,j,r) x E b (i,j,r). The following 
theorem bounds the approximation error of this estimation. 



Theorem 2. 

l o) 2 )||o r .||i||6 r 



\a r b r — a r b r \ > (e 



+ 2e s 

Scm 



+ £cm(l + 



1 1 with probability at most S 
PROOF. In the appendix. □ 
The error is therefore ~ (2e 3W +e C m)\ \a r \ |i||M |i, since the higher- 
order components are dominated by e sw and e cm . Similar to the 
analysis for point queries, we can find the optimal pair of e su) 
and e cm guaranteeing a maximum error of e||a r ||i||fcr||i by us- 
ing derivation on the total memory requirements: t sw = — 1 — 

3+3f (9+9 E +v / 3V 2 8+57e+30e 2 +e 3 ) 3 

^7 , 1 ~! I 

3 3 ^9+9e+x/3 v / 28+57 f +30e 2 +e 3 J 3 3 3 

ana e cm — (i+ Cguj yi ■ 

4.2 Extensions 

4.2.1 Time-based Vs Count-based ECM-Sketches 

Exponential histograms were originally developed for count-ba- 
sed sliding windows. They can be easily extended for time-based 
sliding windows as follows. First, each entry in the data structure is 
identified using its arrival time, instead of using its position in the 
stream. To reduce memory, arrival times are stored in wraparound 
counters of 0(log(7V)) bits, where TV is the length of the sliding 
window, e.g., in milliseconds. Second, entries expire based on their 
arrival time, and not on their position in the stream. Finally, we re- 
quire an upper bound of the number of arrivals within the sliding 
window time range for each stream S, denoted as u(N, S). Note 
that this is required only for computing the maximum memory re- 
quirements of the structure a priori; it does not have an impact on 
the actual required memory or quality of ECM-sketches. Further- 
more, the bound can be very loose without a noticeable change 
on the estimated space requirements, because space complexity in- 
creases only logarithmically with u(N, S). 

Complexity. We use N to denote the length of the sliding window, 
either in number of arrivals or in time, depending on the desired 
sliding window model. With ^(A^, S) we denote the upper bound 
of the number of arrivals in stream S within a sliding window of 
length N. Also, g(N, S) = max(u(iV, S),N). 

To get an e slu -approximation of the number of one-bits in the 
sliding window, exponential histograms require 0(log(A) + 
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Exponential Histogram 


Deterministic ^Vave 


Randomized ^Vave 


Memory 
Amort, update 
Worst update 
Query 


0(iln(i)ln 2 (g(iV,S))) 
0(ln(l/<5)) 
0(ln(l/<5) \n(u(N, S))) 
0(ln(l/5)ln(n(7V, S))/^l) 


0(±ln(±)ln 2 ( 9 (jV,S))) 
0(ln(l/«5)) 
0(ln(l/5)) 
0(ln(l/5)ln( U (JV, S))/y/e) 


0(^ln 2 («5)ln 2 («(iV,5))) 
0(ln 2 (5)) 
0(ln 2 (<5)ln(u(7V,S))) 
0(ln 2 (5)(ln(«(iV,S)) + l/e 2 )) 



Table 2: Computational and space complexity of ECM-sketches. Function g(N, S) is used as a shortcut for max(tt(7V, S),N). 



log log(it(iV, S))) memory per bucket, to store the bucket size and 
bucket boundaries. The number of buckets is 0(log(u(N, S))/e sw ), 
yielding a total memory of 0(log 2 (gi(A'', S))/e sw ). With respect to 
computational cost, the update cost per element is 0(\og(u(N, S))) 
worst-case, and O(l) amortized time. Queries covering the whole 
sliding window are executed in constant time. For queries with 
range N' < N, the required time is 0(log(u(iV, S)/e S w))- The 
extra time is required for finding the oldest bucket overlapping with 
the query, assuming sequential access. If the storage model of the 
buckets supports random access, e.g., a fixed-length array, then this 
time can be further reduced to 0(log(log(u(iV, S) /e sw ))), by em- 
ploying binary search. 

The space complexity of ECM-sketches is as follows. For the 
Count-Min array, we require an array of width w — \e/e cm \ and 
depth d = [ln(l/<5)] . Each cell in the array stores an exponential 
histogram, requiring 0(\og 2 (g{N, S))/e sw ) bits. Therefore, the 
total memory requirements are 0(- — - — log (g(N, S)) log(l/<5)). 
With respect to the time complexity, adding an element requires 
computing d hash functions, and updating d separate exponential 
histograms. The amortized complexity for each arrival is there- 
fore O(d) = 0(log(l/<5)), whereas the worst-case complexity is 
0{dlog(u{N,S))) = 0(log(w(iV,S))log(l/<5)). Finally, query 
execution takes 0(log(l/ 5)) time for a query of range N' equal to 
N. For N' < N, the execution cost is 0(dlog(u(N, S))/e sw ) = 
0(log(l/<5) log(u(N, S))/^fe) with sequential access to buckets, 
e.g., using a linked list. With random access support, binary search 
can be used for finding the last relevant bucket for each query, re- 
ducing the query cost to 0(log(l/<5) log(log(tt(iV, S))/y/e)). 

4.2.2 ECM-Sketches based on Waves 

The sliding window counters can also be materialized using other 
sliding window algorithms. In the literature, two such algorithms 
are particularly well-known: (a) deterministic waves, and, (b) ran- 
domized waves [15]. We now show how ECM-sketches can in- 
corporate these algorithms, and discuss the positive and negative 
aspects of each variant. 

Deterministic Waves. Deterministic waves [15] have identical 
memory requirements with exponential histograms, and they out- 
perform exponential histograms with respect to worst-case com- 
plexity for updates, requiring always constant time. As such, the 
space and computational complexity of ECM-sketches based on 
deterministic waves is the same to the one of sketches based on 
exponential histograms, with the only difference being the worst- 
case update complexity, which is 0(log(l/<5)). 

A downside of deterministic waves is that they require knowl- 
edge of the upper bound of the number of arrivals u(N, S) during 
the initialization of the data structures, to decide on the required 
number of queues/levels. Any overestimation of u(N, S) is there- 
fore translated to an increase on the space requirements - logarith- 
mic with u(N, S). It is important to note that this constraint is 
substantially less limiting compared to the constraints of previous 
algorithms, e.g., [27], which required an upper bound for the total 
number of items in all streams, and therefore could not be applied 
to dynamic networks, with an unknown number of participating 
nodes and streams. 



Randomized Waves. Randomized waves [15] provide an (e,<5) 

approximation for the basic counting problem, i.e., Pr[|a; — x\ < 
e 3W x] > 1 — S sw , where x and x denote the estimated and real 
number of true bits in the sliding window range respectively. This 
structure has substantially higher space complexity compared to the 
deterministic counterparts - 0(l/e 2 „) instead of 0(l/e sw ). How- 
ever, randomized waves are important for distributed applications, 
as they enable lossless aggregation of individual summaries to a 
single summary corresponding to the aggregated data. Therefore, 
we also consider randomized waves for integration with the ECM- 
sketch. 

The space complexity of ECM-sketches based on randomized 
waves is derived by multiplying the space complexity of the two ba- 
sic structures: O (log(<5 cm ) log(S sw ) log 2 (f(N,S))/(e cm e 2 w )). 
Inserting a new element requires 0(log(<5 cm ) log((5 stu )) amortized 
time, and 0(log(<5 cm ) log(S 3W ) log(/(iV, S))) worst-case time. Fi- 
nally, query execution takes 0(log(5 cm ) log(<5 s „) (log(/(iV, S)) + 
1 /t 2 w )) with sequential access to buckets and 0(log(8 cm ) log(S sw ) 
(log log(/(iV, S)) + log(l/e 2 ro ))) time with random access. 

Theorem 3. \f(x,r)- f(x,r)\ > (e 
with probability at most S = S sw + S cm . 

PROOF. In the appendix. □ 
By derivation on the total memory usage, we can find the combi- 
nation of e sw and e cm that minimizes the memory bound: e 3W — 

Ve2+10 +9+e _3 = 3.-V.2 + 10.+9+3 j 

4 e + Ve 2 + 10e+9 + l 

complexity becomes O (log(<5 cm ) log(i5 3m ) log 2 (/(7V, 5))/e 2 ), and 
for<5 cm = S sw = 5/2 it becomes O (log 2 ((5) log 2 (/(iV, S))/e 2 ). 

Table 2 summarizes the main results for the combination of ECM- 
sketches and the three sliding window structures. The results cor- 
respond to both time-based and count-based sliding windows. 

5. ORDER-PRESERVING AGGREGATION 

For many distributed applications, such as the network monitoring 
application described in the introduction, we require aggregating 
individual ECM-sketches CM 1 , CM 2 , CM n , each one cor- 
responding to stream Si, S„, to get a single ECM-sketch 
CM® that corresponds to the logical stream 5© = Si © 5*2 © 
• • • © S„. The © operator is defined as an aggregation that pre- 
serves the ordering and arrival time of the events. Standard Count- 
Min sketches allow this aggregation, as long as all sketches are 
constructed with identical dimensions and hash functions. For this, 
they rely on the linearity of the Count-Min counters, which are sim- 
ple integers in the general case. However, this does not trivially 
hold for ECM-sketches, where the counters are not simple num- 
bers but complex sliding window structures, since the analysis of 
exponential histograms (as well as all other deterministic sliding 
window structures), does not cover linearity. Although random- 
ized structures cover linearity by default, these are substantially 
more expensive, and not preferable for ECM-sketches. Therefore, 
we now consider the order-preserving aggregation of deterministic 
sliding window structures. Note that this problem is interesting by 
itself, since these data structures are widely used in the literature 
for maintaining statistics over sliding windows. We then extend 
our results to cover aggregation of the ECM-sketches. 
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5.1 Aggregation of Exponential Histograms 

Consider a set of exponential histograms EH\ , EH2 , • • ■ , EH n , 
summarizing time-based sliding windows. All are configured to 
cover a sliding window of N time units. The aggregation opera- 
tion is denoted with ©, i.e., EH 9 = EHi © EH 2 © ... © EH n . 
With EH\ we denote bucket j of EHi, and \EH?\ denotes the 
bucket size (number of true bits). By convention, buckets are num- 
bered such that bucket 1 is the most recent. The ending time of the 
bucket is denoted as e(EHf). To ease exposition, we use s(EHf) 
to denote the starting time of the bucket, even though this is not 
explicitly stored in the buckets. By construction, the starting time 
of a bucket is equal to the ending time of the previous bucket, i.e., 
s(EH!) = eiEHr 1 )- 

To construct EH® our methodology considers the individual ex- 
ponential histograms as logs. The general idea is to reconstruct 
£J?e by assuming that half of the elements arrive at the start- 
ing time of each bucket, and the other half at the ending time of 
the bucket. Precisely, let B denote the list containing all buckets 
of all sliding windows. We initialize an empty time-based expo- 
nential histogram with error e', configured to keep the last N time 
units, and a maximum of Y17=i \EHi\ elements. For each bucket 
B[i] £ B, we simulate the insertion in EH® of \B[i] | true bits. Half 
of the bits are inserted with timestamp s(B[i]), and the other half at 
time e(B[i\). The insertions are simulated in the order defined by 
the starting and ending timestamps of the buckets. 

THEOREM 4. Consider n time-based exponential histograms 
EH\, EH2, ■ ■ ., EH n , initialized with error parameter e, and cov- 
ering the same time range. The exponential histogram EH® ini- 
tialized with error parameter e , and constructed with the proposed 
aggregation algorithm answers any query within its time range for 
the stream S® with a maximum relative error of(e + e' + ee'). 

We will now give the intuition of the proof. The formal proof 
is presented in the appendix. Each exponential histogram EH of 
stream S configured with error parameter e can be used to recon- 
struct an approximate stream 5", as follows: For each bucket 6 in 
EH, add \b\/2 true bits in time s(b), and \b\/2 true bits in time 
e(b). We argue that answering any query with starting time s q 
within the range of EH using the reconstructed stream S' will 
result to a maximum relative error e. Let bj be the bucket s.t. 
s(bj) < s q < e(bj). Therefore, the accurate answer x of the 
query for stream S is bounded by x > Ei=i + 1 x — 
Ei=i \bi\ + \bj\- By construction, the reconstructed stream will 
contain a total of X);=i \bi\ + \bj\/2 items with timestamp greater 
than or equal to s q . Therefore, answering the query by counting 
the number of true bits in the reconstructed stream with times- 
tamp after s q will have a maximum error of max(?i — Ei=o M + 
IMAEto N + IM/ 2 -0 = l b J'l/ 2 - By invariant 1 of expo- 
nential histograms, 

N/2 < e(l + ECi 1 N) < ex. Therefore, 
the maximum difference between the answer estimated by stream 
S' and the correct answer x will be less than or equal to ex. 

Our aggregation algorithm is equivalent to reconstructing each 
stream S[ from exponential histogram EHi, and using these to 
recreate an exponential histogram EH®. The reconstruction of 
stream S' introduces a maximum relative error e, as explained above. 
Summarizing S' with a new exponential histogram we get an ad- 
ditional error e . However, e' is relative on the answer provided 
by stream S' , and not by S. Therefore, the absolute error due 
to the exponential histogram summarization will be e'x', where 
x' € (1 ± e)x and x denoting the accurate answer on Si. Sum- 
ming both errors, we get a total relative error of e + e' + ee . 

For the special case when e = e, the maximum relative error 
becomes 2e + e 2 . Concerning space and computational complexity, 



EH® behaves as a standard exponential histogram, and therefore 
has the same complexity as presented in [12]. □ 

Multi-level Aggregation. It is frequently desired to aggregate slid- 
ing windows in more than one levels. For example, consider a 
hierarchical P2P network, where each peer maintains its own ex- 
ponential histogram, and pushes it to its parent for aggregation at 
regular intervals. Since the aggregated exponential histograms have 
the same properties as the individual exponential histograms (albeit 
with a higher e), the above analysis also supports iterative aggrega- 
tion of exponential histograms. 

There are two types of approximation error that influence the 
estimation of an aggregated exponential histogram. A possible ap- 
proximation error, denoted as err x , is introduced due to halving of 
the size of the last bucket of the aggregated exponential histogram. 
This error occurs only at query time, and is independent of the num- 
ber of performed aggregations. Therefore, at a multi-level aggre- 
gation scenario this error does not need to be propagated at the in- 
termediary exponential histograms. A second type of error, termed 
as err2, occurs due to the inclusion (exclusion) of data that arrived 
before (after) the query starting time in buckets that are accounted 
(not accounted) in the query result. 

It turns out that the error err2 is additive at the worst case (in 
absolute value). For instance, in the lowest level (Level 0) of the 
hierarchy, aggregating two exponential histograms (all with relative 
error e), having a true number of bits (in a given query range) equal 
to ii and i 2 , will result at a maximum value for err 2 < e(ii + 12). 
In Level 1, in addition to the previous possible errors, e(ii + 12) + 
e(;h + ii) stream items may be incorrectly registered at the wrong 
side of the query start time. A recursive repetition for h levels 
results to err 2 < hei, where i — ^ . ij. The total absolute error 
(including erri) then becomes err = err2+erri < hei + e(i + hei), 
resulting to a maximum relative error of he(l + e) + e. 

In many applications, the number of aggregation levels can be 
predicted, or even controlled when constructing the network topol- 
ogy. For example, consider DHT-based or hierarchical P2P topolo- 
gies, which typically enable a balanced-tree access to the peers of 
height h = log (AT), where N is the number of nodes. In such sys- 
tems, initializing the individual exponential histograms with error 
y/i+2h+h^+ih€-i h y- g j t j s an a gg re g a t e( j exponential histogram 
of relative error e. Naturally, this causes a slight inflation of the 
size of the sliding window, by 0(log(A r )). However, even with this 
inflation, exponential histograms are - even for extremely large net- 
works - substantially smaller and more efficient than randomized 
data structures that enable error-free aggregation in the expense of 
memory proportional to 0(l/e 2 ) (see also Section 5.2). 

Deterministic Waves. The aggregation technique trivially extends 
for deterministic waves. Recall that each wave is composed of / 
levels, each covering a different range. To perform the aggregation, 
we start from the lowest level I — 1, and switch to a higher level 
every (l/e + l)/2 bits, i.e., when the first entry in the higher level 
has arrived before the next entry in the current level. Repeating the 
calculation of the error bounds for the aggregation of deterministic 
waves becomes straightforward when we notice that invariant 1 of 
the exponential histograms is also true for deterministic waves. 

Count-based Exponential Histograms. Although exponential his- 
tograms cover both time-based and count-based sliding windows, 
aggregation of exponential histograms is specific for time-based 
sliding windows. Count-based sliding windows do not contain suf- 
ficient information for allowing order-preserving aggregation. Even 
storing the system-wide time of the buckets would not be sufficient 
to allow such an aggregation. To illustrate this limitation, consider 
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the two count-based exponential histograms depicted in Fig. 2. For 
each bucket we store the bucket id, the size of the bucket, the bucket 
completion time and the total number of arrivals until that time. An 
arrival in count-based sliding windows might be a true or a false 
bit. An example query can then be: how many true bits arrived in 
the last 100 system-wide arrivals. If these 100 system- wide arrivals 
were read between time 19 and 20, then the correct answer would 
be 1. However, it is also possible that the last 100 system- wide ar- 
rivals have arrived between time 3 and time 20, in which case the 
correct answer could be anything between 2 and 9. The information 
contained in the two exponential histograms is not sufficient to es- 
timate this type of queries, as it only allows us to preserve the order 
of the true bits, but looses the order of the false bits, which is also 
important. Therefore, given only the exponential histograms, it is 
not possible to aggregate them in a way that preserves the ordering 
of both true and false bits. Deterministic and randomized waves 
also have the same limitation when it comes to order-preserving 
aggregation of count-based sliding windows. 

5.2 Aggregation of Randomized Waves 

Randomized waves were proposed in [15] to address the prob- 
lem of distributed union counting: counting the number of 1 's in 
the position-wise union of t distributed data streams, over a slid- 
ing window. However, the existing algorithm for utilizing more 
than one randomized waves does not consider aggregation of sev- 
eral waves, to generate a single wave. It assumes that the individual 
randomized waves can be stored and accessed any time, which is 
inconvenient for large networks. To eliminate this assumption we 
now propose a slight variation of their algorithm that can produce 
a single randomized wave out of a set of individual waves, with the 
same probabilistic accuracy guarantees as the individual waves. 

Our algorithm simulates the construction of the aggregate ran- 
domized wave RW® by using only the information included in 
the individual randomized waves. Consider a set TZ of randomized 
waves RWi, RW2, RW„, configured to store a sliding win- 
dow of N time units, with error parameters e and 8. The aggregate 
randomized wave RW® is initialized with the same e and 5 pa- 
rameters, for storing a maximum of J^ILi |-RWi| events over N 
time units. Each level I of RW® is then constructed by concatenat- 
ing the corresponding level / from all individual randomized waves, 
sorting all events based on the timestamp, and keeping the last c/e 2 
events. Recall that the number of levels of individual randomized 
waves is determined based on the maximum number of events in 
the sliding window. Therefore, it may happen that RW® has more 
levels than individual randomized waves. To populate the lower 
levels of RWq), we rehash the events populating the last level of 
each individual randomized wave, as proposed in [15] when merg- 
ing different levels from randomized waves. 

The process of query execution and the accuracy guarantees re- 
main the same as for the standard randomized waves. 

5.3 Composability of ECM- Sketches 

Consider a set of ECM-sketches CM U CM 2 , . . ., CM n with 
identical dimensions and hash functions. The ECM-sketch CMg 
with each counter set to the sum of all corresponding counters from 
the individual sketches (as defined by the © operator), summarizes 
the information found in the individual sketches: 

CM e [j, k] = CM! [j, k] © CM 2 [j, k] © . . . © CM n [j, k] 

To bound the estimation error, we consider the two sources of 
error in the aggregated ECM-sketch. The error due to the Count- 
Min sketch e cm does not change, since it only depends on the di- 
mensionality of the Count-Min array, which is fixed. However, 
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Figure 2: An example why aggregating count-based exponen- 
tial histograms is not possible. 



the error due to sliding window estimations at each counter might 
change with each aggregation. Let e' sw denote the error produced 
by the aggregation of the corresponding Count-Min counters, as 
discussed in Sections 5.1 and 5.2. Recall that this error depends on 
the data structure used for maintaining the sliding window. Simi- 
lar to the case of individual ECM-sketches, the total error is e = 
£cm + e'sw + tcmt'sw, with probability 1 - S sw - S cm . 

6. OTHER APPLICATIONS 

In addition to point and inner product queries, ECM-sketches can 
also address more complex requirements. We now briefly discuss 
two such cases: (a) finding the frequent items, and, (b) continu- 
ous monitoring of the value of inner joins or point queries over 
distributed streams. Additional problems, such as computing quan- 
tiles or answering range queries over sliding windows, can also be 
addressed, e.g., by adapting the algorithms proposed for Count-Min 
sketches [10] to employ ECM-sketches instead. 

6.1 Finding the Frequent Items 

Consider a stream S containing items from the universe U. The 
straightforward solution for finding the frequent items in the slid- 
ing window is to execute |U| point queries on the ECM-sketch, one 
for each item in the universe, and retain only the items above the 
desired frequency threshold. However, this approach carries a com- 
putational complexity of 0(|U| x ln(l/<S)) for executing all queries 
and detecting the frequent items, which is clearly prohibitive for 
streaming algorithms. 

A more efficient algorithm based on range sums is proposed by 
Cormode et al. [10], and can be adapted to ECM-sketches for ad- 
dressing the sliding-window requirements. The algorithm relies on 
group testing, for progressively reducing the domain of candidate 
frequent items, until only the truly frequent items remain. The basic 
idea is to create log(|U|) ECM-sketches, denoted as CM , CMi, 
. . . CMi g(|u|)-i, to keep the number of occurrences of ranges of 
items. The i'th ECM sketch is used to maintain the range sum of the 
necessary dyadic ranges of length 2 l for covering U. A new arrival 
x e V is handled by adding [a;/2 i J to CM;, for < i < log(|U|). 
To detect the frequent items, we start with CM log (|o|)-i, estimat- 
ing the number of occurrences of the contained dyadic ranges. If 
any of the dyadic ranges has an estimated frequency less than the 
frequency threshold <j>, the whole dyadic range is ignored, as it can- 
not contain a frequent item. For all ranges with frequency surpass- 
ing <j>, the test continues recursively by breaking the range in two, 
and using the ECM-sketch of the lower level. 

There are some interesting variants of the above problem, mostly 
relating to the way the threshold <j> is expressed by the user. If <j> is 
given as a minimum number of occurrences of each item, then no 
further computation is needed to determine which dyadic ranges 
are frequent and which are infrequent. However, it is often useful 
to express <f> as the ratio of the number of occurrences of each item 
to the total number of arrivals within the sliding window. For time- 
based sliding windows, we can estimate the total number of arrivals 
by maintaining an additional sliding window, e.g., a deterministic 
wave, and using its lower bound. A better alternative that does 
not require additional memory is to use ECM-sketch CMo to esti- 
mate the total number of arrivals, by summing all counters in each 
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Figure 3: Local constraints using the Geometric Approach. 
Each node constructs a sphere with diameter the drift vector 
u of the node and the estimate vector e. The global statistics 
vector v is guaranteed to lie in the convex hull of e, ul, u2, 
u3, u4. The union of the local spheres covers the convex hull. 

row, and getting the average value. Although this approach has the 
same error bounds, in practice it offers better estimation accuracy 
than maintaining a single additional sliding window, since the er- 
rors coming from all counters in each row are usually canceled out. 

This estimation based on ECM-sketches may result to false pos- 
itives and false negatives. Theorem 5 allows us to bound this error. 

THEOREM 5. The proposed algorithm uses 
0((log|U|/e)log(2 log \m{54>)) log 2 (g(N,S))) memory and 
amortized time 0(log(2 log |U|/<5) log U|) per update, for detect- 
ing every item with frequency at least (<f)+e)\\a\ |i. With probability 
1 — S, no item with frequency less than <^||a||i is output. 

The same algorithm for approximating range sums can also be 
used for range queries, by noticing that all valid ranges within 
U can be expressed by a sum of dyadic ranges [10]. The error 
guarantees in this case are identical to the ones for Count-Min 
sketches, as described in [10], whereas the memory requirements 
are 0((l/e) log(l/<5) \og 2 (g(N, S)) log |U|) bytes, for maintain- 
ing the log |U| ECM-sketches. 

6.2 Continuous Monitoring of Functions for 
Threshold Crossing 

In many application domains, continuous monitoring of func- 
tions is required. ECM-sketches can also be used in these scenar- 
ios to reduce the memory and network requirements. We give the 
main intuition on how this can be done using self-join queries over 
sliding windows as an example. 

We combine ECM-sketches with the geometric method [25]. The 
geometric method allows the distributed monitoring of complex 
(non-linear) functions defined over the average of local vectors 
(termed as local statistics vectors) maintained at sites. The goal 
is to to drastically reduce the required coordination for monitoring 
threshold crossing of such complex functions in a distributed net- 
work. The main idea is to distributively monitor the domain space 
where the average vector may lie. Each site monitors a portion 
of the corresponding subset of the domain space, with the corre- 
sponding monitoring zone often being expressed as a hypersphere. 
A common reference point of all such hyperspheres is the global 
estimate vector, which is the average vector computed during the 
last global communication (often called as a synchronization step) 
among all sites. Figure 3 depicts this process. 

In this context, ECM-sketches are used to represent: 

• The local statistics vectors at each site. The ECM-sketches are 
denoted as svi{t), sv2(t), . . . , sv„(t), where n is the number 
of sites. All sketches have an identical configuration. 

• The global statistics vector. This vector is the current average 
over all local statistics vectors. The value of this vector is un- 



known to all sites, unless a synchronization takes place. The 
global statistics sketch is denoted as and is computed by 

a linear aggregation of the local statistics sketches. We also 
use to denote the global estimate vector, which is the last 
known value of the global statistics vector. 

Out of these two ECM-sketches, we can also compute the fol- 
lowing two vectors, required by the geometric method: 

• The statistics delta vectors, denoted using This vec- 
tor is equal to the difference between the local statistics vector 
and the corresponding vector that was transmitted in the last 
synchronization. 

• The drift vectors, denoted as ittj(t), where ittj(t) = se(t) + 
As$i(t). The global statistics vector is guaranteed to lie in the 
convex hull of the drift vectors, while this convex hull is cov- 
ered by the union of hyperspheres monitored by the sites. Each 
hypersphere of a site is constructed with diameter the global es- 
timate vector and the corresponding drift vector of the site [25]. 

To initialize the monitoring process, all nodes send their local 
statistics vectors svi(t), sV2(t), . . . ,sv„(t) to a coordinator. The 
coordinator aggregates all vectors using the algorithm for order- 
preserving aggregation of ECM-sketches, and computes a single 
global statistics vector sv(t). This global statistics vector is called 
the global estimate vector, and it is propagated to all network nodes, 
e.g., by using a hierarchy, or a broadcasting technique. This es- 
timate vector is used by each participating node to extract a set 
of Count-Min sketches, one for each query range. Without loss 
of generality, assume that we have only a single query range, and 
se(t) denotes the corresponding extracted Count-Min sketch. 

After each new arrival at time t' , node pi updates its local statis- 
tics vector sii, and checks for a local constraint violation. For this 
check, pi extracts the statistics delta vector Asv(t') from stii(t') 
as a Count-Min sketch, by querying each counter of sti(t') for its 
value within the time range (t, t']. By summing Ast>(t') with ~s&(t) 
the node can compute the drift vector siii(t'), again as a Count- 
Min sketch, and construct the sphere of the geometric method. The 
sphere is formed with a center k = ( ~si (t) + si,i (t'))/2, and radius 
a = ||(s^(t) — si,i(t'))\\/2. The geometric method guarantees 
that if the maximum and minimum value of the function within the 
sphere are at the same side of the threshold, then there can be no 
threshold crossing caused by this update. For computing the max- 
imum and minimum value of the function efficiently, we currently 
have closed form equations for simple functions, like self-joins. 
Sharfman et al. [25] propose using numerical analysis algorithms, 
to compute these extrema, e.g., with Matlab. We are still work- 
ing on this problem, to achieve efficient analytic solutions for more 
function types. 

7. EXPERIMENTAL EVALUATION 

Our experiments focused on evaluating ECM-sketches with re- 
spect to their scalability, effectiveness, and efficiency, as well as 
their suitability for distributed setups. The experiments were con- 
ducted using two frequently used real-life data sets, the world- 
cup'98 [2] (wc'98) and the Crawdad SNMP Fall 03/04 data set [21] 
(snmp). The wc'98 data set consists of all HTTP requests that 
were directed within a period of 92 days to the web-servers host- 
ing the official world-cup 1998 website. It contains a total of 1.089 
billion valid requests, served by 33 server mirrors. Each request 
was indexed using the web-page url as a key, i.e., the ECM-sketch 
could be used for estimating the popularity of each web-page. The 
snmp data set contains a total of 134 million records collected 
from the wireless network of Dartmouth college during the fall 
of 2003/2003. For this data set, we have used the (anonymised) 
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MAC addresses of the clients as keys for indexing. Therefore, 
the ECM-sketch enabled estimating the traffic volume generated 
by each user. 

We have compared three sketch variants, differentiating on the 
employed sliding window algorithm: (a) the default variant de- 
scribed earlier which is based on exponential histograms, denoted 
as ECM-EH, (b) a variant using deterministic waves (ECM-DW), 
and, (c) a variant based on randomized waves (ECM-RW). The 
comparison between the variants was performed to demonstrate the 
influence of the sliding window algorithm to the performance of 
ECM-sketches. 

7.1 Implementation Details 

ECM-sketches were implemented in Java 1.7 using 32-bit ad- 
dressing, and executed on a single idle core of an Intel Xeon 1.6 
GHz machine. Deterministic and randomized waves were imple- 
mented as described in [15], including all optimizations. The queues 
were implemented as fixed-size deques. The waves were initialized 
using one event per millisecond as an upper bound for the num- 
ber of arrivals within the sliding window. In practice, it is rarely 
possible to predict the maximum number of events per sliding win- 
dow, and therefore conservative estimates, like this one, are often 
the only option. Concerning exponential histograms, [12] does not 
provide sufficient details for the implementation of the list of buck- 
ets. We therefore considered different possibilities for maintaining 
the buckets, including fixed arrays, deques, doubly-linked lists, and 
tree lists, and their combinations. The most efficient implementa- 
tion was a combination of fixed arrays with deques, which enabled 
random access to buckets and constant-time bucket merges. Specif- 
ically, the bucket list was divided to different levels Lq, Li , . . . , L;. 
Each level Li was initialized as a fixed-length deque, for storing 
only the buckets of size 2\ Furthermore, to save memory, all levels 
were initially set to null, and initialized on request. The space and 
computational complexity of our implementation is as described in 
Section 6, for the random-access model. 

Unless otherwise noted, all ECM-sketches were set to monitor a 
sliding window of 1 million seconds (1 1.5 days). Queries were gen- 
erated with an exponentially increasing range, i.e., query qi covered 
the range [t — 10' , t] , with t denoting the time of the last arrival. For 
each range, a self-join query, as well as a set of point queries were 
constructed and executed. For thorough evaluation, we constructed 
one point query for each distinct item in the query range (i.e., es- 
timating the popularity of each web-page in the wc'98 dataset, or 
the number of snmp messages generated by each MAC address in 
the snmp dataset). 

7.2 Centralized Setup 

In the centralized scenario, a single node monitors the whole 
stream and maintains an ECM-sketch, which is subsequently used 
for answering the queries. We first consider the tradeoff between 
memory requirements and estimation error. For this, we vary e 
within the range of [0.05, 0.25], keeping 8 = 0.1. For each e value, 
we use the analysis presented in Section 4 to configure the ECM- 
sketch such that the required memory for the targeted query type is 
minimized - hence the difference in the cost of point queries and 
self-join queries for the same e values. 

Figures 4(a)-(d) plot the average and maximum observed error 
in correlation to the required memory for the two data sets. The 
figures are annotated with indicative e values. The displayed error 
at the Y axis is relative to the number of events arriving within the 
query range, i.e., for point queries, err = \f(x, r) — f(x, r)\/\ \a r \ |i 
and for self-joins, err = \a r a r — a r a r |/(||a r ||i) 2 . Recall 
that the ECM-RW structure does not allow probabilistic guarantees 



for self-join queries, and is therefore not considered for this type 
of queries. Table 3 presents sample update rates for the considered 
variants, for e = 0.1. 

Our first observation is that, for all variants, both the average and 
maximum observed errors are lower than the user-selected value e. 
However, the memory requirements of ECM-RW are at least an or- 
der of magnitude higher than the requirements of ECM-sketches 
based on the two deterministic structures for offering the same ac- 
curacy guarantees. As an example, for the wc'98 experiment with 
a moderate value of e = 0.1, the cost of maintaining the ECM-RW 
sketch is already 400 Mbytes, whereas the ECM-sketches based on 
exponential histograms and deterministic waves require less than 
a megabyte for satisfying the same guarantees (the simulation of 
ECM-RW configured with e = 0.05 could not be completed due 
to insufficient main memory). This happens because the memory 
requirements of randomized waves grow quadratically with 1/e, 
whereas the two deterministic sliding window algorithms scale lin- 
early. Note that this negative result applies to all known random- 
ized sliding window algorithms, e.g., [27, 11], since they all scale 
quadratically with 1/e. As such, ECM-sketches based on determin- 
istic structures are more applicable for scenarios with non-specia- 
lized hardware, or hardware with less memory, like sensor net- 
works and network devices. Comparing the two deterministic meth- 
ods, we see that ECM-EH sketches are faster and more compact, 
requiring approximately half the space compared to the ones based 
on deterministic waves. All results are consistent for both data sets. 

Summarizing, these results demonstrate that ECM-EH sketches 
are more efficient and compact compared to the other two variants, 
and that ECM-RW sketches require at least an order of magnitude 
more memory to satisfy the accuracy guarantees compared to the 
two variants based on deterministic sliding window structures. 

7.3 Distributed Setup 

The second series of experiments focused on evaluating the ap- 
plicability of ECM-sketches for distributed setups. For this, we 
conducted simulations of distributed networks using the real-world 
distributions obtained from the two data sets. In particular, wc'98 
contains the server identification for each of the 33 official world- 
cup servers answering the HTTP requests, whereas the records in 
the snmp data set contain the identification for each of the 535 mon- 
itored APs. For our simulations, these servers were organized in an 
architecture resembling a balanced binary tree of height [~log 2 (n)] , 
where n is the number of servers. All servers resided at the leaf 
nodes of the tree. Some of these servers were also randomly cho- 
sen to occupy the internal tree nodes, responsible for aggregation of 
the ECM-sketches coming from the children nodes. At the end of 
the aggregation process, the root node of the hierarchy was holding 
a single ECM-sketch, representing the order-preserving aggrega- 
tion of the n streams generated in [log 2 (n)] — 1 steps. ECM-DW 
sketches are not considered in this set of experiments, since they do 
not offer any advantages compared to ECM-EH sketches. 

Figures 5(a)-(b) plot the average observed error for point and 
self-join queries in correlation to the network requirements for the 
whole aggregation to be completed. The results correspond 
to t 6 [0.05,0.25] and S = 0.1. Note that the simulation with 
ECM-RW sketches did not complete for all e values, due to insuffi- 
cient memory resources at the machine simulating the n nodes. To 
illustrate the accuracy loss due to this aggregation, Table 4 presents 
a comparison between the observed error of the centralized and the 
distributed ECM-sketches. 

As expected, the process of iterative aggregations causes an in- 
crease of the observed error for ECM-EH sketches. This error how- 
ever is still substantially lower than the upper bound derived by 
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Table 4: Observed error - loss is due to the iterative aggregation. 

sketches based on randomized waves, the sketches based on expo- 
nential histograms are substantially more compact, and are there- 
fore applicable for a wider range of application scenarios, where 
network cost and memory is of the essence, such as P2P networks, 
sensor networks, and communication between network routers. 
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Figure 5: Observed error in correlation to the network cost, for 
varying e: (a) wc'98 data set, (b) snmp data set. 

the analysis. For example, for the case of the wc'98 data set with 
e = 0.1, the error bound is 0.3, whereas the average observed er- 
ror after aggregation is less than 0.015, i.e., the increase due to the 
aggregation is less than 1/4 of the experimentally derived error of 
the centralized sketch. Concerning ECM-RW sketches, there is no 
systemic variation of the error, since randomized waves enable a 
lossless aggregation at the expense of a larger memory footprint. 
However, the network required for performing this aggregation us- 
ing ECM-RW is higher by at least an order of magnitude compared 
to the transfer volume for the variant with exponential histograms. 
This requirement is prohibitive for a large set of application sce- 
narios, like sensor and mobile networks, where high network usage 
causes battery drainage. 

To further explore the influence of the network size on the esti- 
mation accuracy and network cost, we have also simulated an arti- 
ficial network of i servers, with i = {1, 2, 4, . . . , 256}. The nodes 
were again placed as leaf nodes on a balanced binary tree, and the 
requests were divided uniformly across them. Figure 6(a) and (c) 
plot the average observed error in correlation to the network size, 
for e = S = 0.1. As expected, for ECM-EH sketches, increasing 
the number of nodes leads to a small increase on the observed esti- 
mation error. On the other hand, the aggregation process does not 
affect the accuracy of ECM-RW sketches, due to the lossless aggre- 
gation of randomized waves. However, the network cost for aggre- 
gating the sketches based on randomized waves (Figure 6(b) and 
(d)) is at least an order of magnitude higher compared to ECM-EH. 
This limits the applicability of ECM-sketches based on randomized 
waves to cases where a fast, fixed network is available, and makes 
the ability to merge deterministic sliding windows, e.g., based on 
exponential histograms, a very important contribution of this work. 

Summarizing, this set of experiments showed that ECM-sketches 
based on exponential histograms can be aggregated with very small 
information loss. Compared to the lossless aggregation of ECM- 



8. CONCLUSIONS 

In this work we considered the problem of answering complex 
queries over distributed and high dimensional data streams, in the 
sliding window model. Our proposal, ECM-sketches, is a com- 
pact structure combining the state-of-the-art sketching technique 
for data stream summarization with deterministic sliding window 
synopses. The structure provides probabilistic accuracy guaran- 
tees for the quality of the estimation, for point queries and self-join 
queries, and can enable a broad range of problems, such as finding 
heavy hitters, computing quantiles, and answering range queries 
over sliding windows. 

Focusing on distributed applications, we also showed how a set 
of ECM-sketches, each one representing an individual stream, can 
be aggregated to generate a single ECM-sketch that summarizes 
the stream produced by the order-sensitive aggregation of all indi- 
vidual streams. Interestingly, this is the first result in the literature 
enabling such aggregation for sketches that use deterministic slid- 
ing window synopses, and it is of high importance since determin- 
istic synopses are generally a factor of 0(l/e) more compact than 
the best-known randomized synopsis for delivering an e-accurate 
approximation. In the same context, we demonstrated how ECM- 
sketches can be exploited for detecting frequent items, as well as 
within the geometric method for answering continuous queries. 

ECM-sketches were thoroughly evaluated with a set of extensive 
experiments, using two large real-world datasets, and considering 
both centralized and distributed setups. The results verified the 
high performance of the structure. Compared to structures based 
on randomized sliding window synopses, ECM-sketches improve 
the memory and computational complexity by at least one order of 
magnitude. The same magnitude of improvement is observed with 
respect to the network requirements. 

Our future work includes further investigation on employing 
ECM-sketches for the geometric method, for handling additional 
types of continuous queries over distributed sliding window streams. 

Acknowledgments. This work was supported by the European 
Commission under ICT-FP7- LIFT-255951 (Local Inference in 
Massively Distributed Systems). 
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APPENDIX 

PROOF OF THEOREM 2. We consider the estimation derived by 
any single row j of the ECM-sketch. We first check the case of 

E((a r b r )j) > a r b r : 



E((a r b r ) 3 - a r Qb r ) = ^ f a {x,r)f b (x, r) + 

x£T> 

fa(p,r)h(q,r) f a (x,r)f b (x,r) 



h j (p) = h j (q) 



< Y J U^,r)f b {x,r){l + e aw ) 2 + 

x£T> 

^ fa(p,r)f b (q,r)(l + e 3W ) 2 - ^ f a (x,r)f b (x,r) = 



hj(p)=hj(q) 

(e 2 w + 2e sw )a r Q b r + ^ f a (p, r)f b (q, r)(l + e sw ) 2 (1) 

P,lST>,p^tq 
hj(p) = hj(q) 

From [10], we know that E(J2 Pl geD, P7 s a fa(p,r)f b (q,r)) < 

hj (p) = hj(q) 

tcm 1 1 fl r 1 1 1 1 1 b r \ | i /e. Furthermore, by Markov inequality, 

Pr[Vj : ^2 fa(p,r)f b (q,r) - e cm ||a r ||i| |6 r ||i/e] < e~ d < 8 

P,lSV,p^q 
hj (p) = h j (q) 

Combining this with Eqn. 1, we get that with probability at least 
1-6, 

a r b r - a r b r < (e 2 sw + 2e sw )a r O b r + e cm (l + e sro ) 2 ||aT.||i||fe r ||i 

Repeating the analysis for the case of E((a r © b r )j) < a r Ob r 
we get the following probabilistic guarantees: 

a r b r — a r b r < (e 2 w + 2e 3W )a r b r 

The bounds follow directly by noticing that 
a r b r < | \a r | |i 1 1 b r \ |i. □ 

PROOF OF THEOREM 3 . By the estimation algorithm we know 
that there exists at least one row j, for which 
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E(hj(x),j,r) — f(x,r). Let us focus now on this row. We ini- 
tially assume that we have an accurate algorithm to maintain the 
sliding window counters, i.e., errors are only due to hashing colli- 
sions. With R(hj (x) , j, r) we denote the accurate number of bits 
that were added in the counter (hj (x),j), within the query range r. 
Note that, because of hashing collisions, the value of R(hj(x),j, r) 
might be greater than the real frequency of x, denoted as f(x, r). 
In fact, since the counters are assumed to be accurate, the standard 
analysis introduced for count-min sketches may be applied. There- 
fore, Pr[R(hj(x),j,r) - f(x,r) < e cm ||o r ||i] > 1 - 6 cm => 
Pr[R{hj{x),j,r) < f(x,r) + e cm \\a r \\i] > 1 - 6 cm . 

However, in practice, the sliding window algorithm may intro- 
duce errors to the computation of R(hj (x),j, r). Since all consid- 
ered algorithms are (e, 8) -approximate, we know that their estima- 
tion E(hj(x),j, r) has the following property: Pr[\E(hj(x), j, r) — 
R(hj(x),j,r) < e sw R(hj(x),j,r)] > 1 - 6 SW . 

For the case that E(hj(x), j,r) > R(hj(x), j,r), we have 
Pr[E(hj{x),j,r) < (l + e sw )R{hj(x),j,r)] > 1 - 6 SW . Con- 
sidering the two results together, we get: 

Pr[E(hj(x),j,r) < (l + e sw )R(hj(x),j,r)] > 1 - S sw 
Pr[E(h j (x),j,r) < (l + e.«,)(/(a;,r) + e CTn ||o r ||i)] 

^ 1 &sw $cm = ? > 

Pr[f(x,r) - f{x,r) < e 3W f(x,r) + e cm ||a r ||i + 

tcm^sw | | @>r | 1 1 ] ^ 1 & 

Note that e sw f(x,r) + e cm ||a r ||i + ^cm €sw \\a r \\i < {tsw + 
e cm + n C m e sw ) | \a r \ | i . Therefore, 

Pr[f(x,r)-f(x,r) < (e sw + )||o r ||i] > 1-6 (2) 

With a similar analysis, the case of E(hj(x),j,r) < 
R(hj(x),j, r) gives a much tighter constraint: 

Pr[f(x, r) - f(x, r) < e sw f{x, r)] > 1 - 6 SW (3) 

Note that the events considered by equations 2 and 3 are mutu- 
ally exclusive. The proof is completed by taking the minimum of 

Pr[f(x, r) - f(x, r) < e sw f(x, r)] and Pr[f(x, r) - f(x, r) < 

(£sw ~\~ £cm £cm.£sw ) j |&r | 1 1] • C 

PROOF OF THEOREM 4. We argue that EHq approximates the ex- 
ponential histogram of the logical stream, with a maximum relative 
error of (1 + e)e' + e, where e is the error parameter of the initial 
exponential histograms. Consider a query for the last q time units. 
With s q = t — q we denote the query starting time. Let Q denote 
the index of the bucket of EH& which contains s q in its range, i.e., 
s(EH2) < s q < e(EH2)- With i and i we denote the accurate 
and estimated number of true bits in the query range. According to 
the estimation algorithm, the estimation for the number of true bits 
in the stream will be l = 1/2\EH2\+J2ky<q \ EH <$\- This es " 
timation may be influenced by two types of approximation errors: 
(a) a possible approximation error of the overlap of bucket EH^ 
with the query range, denoted as erri, and, (b) a possible approxi- 
mation error of i, denoted as err2, because of the inclusion of data 
that arrived before s q in buckets Y < Q, or data that arrived after 
s q in buckets Y > Q. Let us now look into these two errors in 
more details. 

With respect to err2, recall that the contents of individual buckets 
are inserted to EH® using the starting time and the ending time of 



the buckets. Therefore, it may happen that some bits arrive before 
s q but are inserted to EH® with a timestamp after s q , creating 
'false positives' . The opposite is also possible. These bits are called 
out-of-order bits with respect to s q . Clearly, out-of-order bits may 
lead to underestimation or overestimation of the query answer. The 
following lemma allows us to upper bound the number of out-of- 
order bits, and thereby control the maximum error err2. 

LEMMA 1. Consider an (individual) exponential histogram 
EH Z of stream Z, configured with error parameter e. The out- 
of-order bits with respect to the query starting time s q that EH Z 
can generate are at most ei z , with i z denoting the number of true 
bits arriving after s q in Z. 

PROOF. Due to the non-decreasing nature of bucket timestamps, 
there can be only one bucket with a start time less than s q and end 
time greater than or equal to s q . Let this bucket be EH 3 Z . All other 
buckets have both starting and ending time at the same side of s q , 
and therefore their contents are always inserted with a timestamp 
at the correct side of s q and do not create out-of-order bits. 

Since the ending time of EH Z is at or after s q , its most recent 
true bit has arrived at or after s q , and should be included in the 
query range. Therefore, the number of true bits arriving at or after 
Sq in stream Z is i z > 1 + Ylt=i \EH%\. Furthermore, since 
half of the bits of EH Z are inserted using the ending time and half 
using the starting time of the bucket, the maximum number of out- 
of-order bits is \EH 3 z \/2. By construction (invariant 1): 

iff i <^gfH<«(i + g| EH ;i)< d , □ 

2(1 + £|EH;|) 

6=1 

The following lemma extends this result to all exponential his- 
tograms constituting EH® , for computing the total value of err2 : 

LEMMA 2. Consider the exponential histogram EH®, con- 
structed by aggregating exponential histograms EH\, EH 2 , ■ ■ ., 
EH n . The maximum value of erri is ei, with i = YlZ=i ^ x 
noting the number of true bits that arrived in all streams during or 
after s q . 

PROOF. Let err2(a;) denote the number of out-of-order bits of 
stream x with respect to s q . Furthermore, j x = max{b\e(EHx) > 

Sq}. 

Notice that err2(x) is upper-bounded by Lemma 1. Due to the 
aggregation algorithm, err 2 = Yl"=i e - Ir2 ( x )- Observing that e 
is the same across all EH, we have: err2 = Y^ n =i sn 2( x ) < 

Underestimation or overestimation of the overlap may also hap- 
pen because of the halving of the size of bucket EH^ during query 
time (erri). As shown in [12], this process may introduce a max- 
imum relative error of er, where r is the sum of the sizes of all 
buckets in EH® with an index lower than Q (i.e., with a starting 
time at least equal to s q ). Recall that r may also include bits that ar- 
rived before s q , which can however be upper bounded by Lemma 2. 
Therefore, the maximum underestimation or overestimation error is 
erri = eV < e'(i + «) = e'i + ee'*, with i — $^™ =1 i x - 

Summing erri and err2, we get a maximum relative error of (e + 
e + ee'). Theorem 4 follows directly. □ 
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